ContextVQA Multimodal QA Data Generation Pipeline

About 602 wordsAbout 2 min

2026-01-24

1. Overview

The ContextVQA Multimodal QA Data Generation Pipeline is designed to automatically generate visual question answering (Context-based VQA) data with external knowledge contexts starting from images. This pipeline utilizes Vision-Language Models (VLM) to generate Wikipedia-style articles related to the images and corresponding QA pairs, which are then parsed into structured data.

We support the following application scenarios:

Knowledge-based VQA Data Synthesis: Building QA datasets that require external knowledge reasoning.
Multimodal RAG Data Construction: Generating high-quality data for training Retrieval-Augmented Generation (RAG) systems.
Visual Reasoning Training: Generating questions that point to the image, but require answers reasoned from the textual context.

The main flow includes:

Data Loading: Reading data files containing image paths.
Context and QA Generation: Utilizing a locally deployed VLM to generate Wikipedia-style articles and raw QA pairs based on the image.
Data Cleaning and Structuring: Parsing raw text to extract a structured {context, qas} format.

2. Quick Start

Step 1: Create a New DataFlow Work Folder

mkdir run_dataflow_mm
cd run_dataflow_mm

Step 2: Initialize DataFlow-MM

dataflowmm init

You will now see:

gpu_pipelines/context_vqa.py

Step 3: Download Example Data

huggingface-cli download --repo-type dataset OpenDCAI/dataflow-demo-image --local-dir example_data

Step 4: Configure Model and Data Paths

Modify the class initialization parameters directly in context_vqa.py (no longer passed via command line arguments):

# Model Serving Configuration
self.serving = LocalModelVLMServing_vllm(
    hf_model_name_or_path="Qwen/Qwen2.5-VL-3B-Instruct",
    hf_cache_dir="~/.cache/huggingface",
    hf_local_dir="./ckpt",
    vllm_tensor_parallel_size=1,
    vllm_max_tokens=512,
)

# Data Storage Configuration
self.storage = FileStorage(
    first_entry_file_name="./example_data/image_contextvqa/sample_data.json",
    cache_path="./cache_local",
    file_name_prefix="context_vqa",
    cache_type="json",
)

Step 5: One-Click Run

python gpu_pipelines/context_vqa.py

3. Data Flow and Pipeline Logic

1. Input Data

Input data is managed through FileStorage, supporting breakpoint resumption.

Input Data Example (sample_data.json):

[
    {
        "image": ["./example_data/image_contextvqa/person.png"],
        "conversation": [
            {
                "from": "human",
                "value": "Write a Wikipedia article related to this image without directly referring to the image. Then write question answer pairs..."
            }
        ]
    }
]

2. Core Operator Logic

A. PromptedVQAGenerator (Context Generation)

This operator calls the local VLM model to generate raw text based on built-in Wikipedia-style prompt templates.

Operator Execution:

self.vqa_generator.run(
    storage=self.storage.step(),
    input_conversation_key="conversation",
    input_image_key=input_image_key,
    output_answer_key=output_answer_key,
)

B. WikiQARefiner (Result Parsing)

This operator cleans the unstructured text generated by the VLM and converts it into a standard format, separating the article content (Context) from the question-answer pairs (QAs).

Operator Execution:

self.refiner.run(
    storage=self.storage.step(),
    input_key="vqa",          # Raw text from the previous step
    output_key="context_vqa"  # Final structured data
)

3. Output Data

The final structured data includes context (article) and qas (list of questions and answers).

Output Data Example:

{
    "id": 1,
    "image": ["./example_data/image_contextvqa/person.png"],
    "context_vqa": {
        "context": "Nightmare Alley is a 2021 American psychological thriller film...",
        "qas": [
            {
                "question": "What genre does this film belong to?",
                "answer": "Psychological thriller"
            }
        ]
    }
}

4. Pipeline Example

Below is the complete ContextVQAPipeline code implementation.

import argparse
from dataflow.utils.storage import FileStorage
from dataflow.core import LLMServingABC
from dataflow.serving.local_model_vlm_serving import LocalModelVLMServing_vllm
from dataflow.operators.core_vision import PromptedVQAGenerator, WikiQARefiner


class ContextVQAPipeline:
    """
    Batch generate ContextVQA data for images with a single command.
    """

    def __init__(self, llm_serving: LLMServingABC = None):
        # ---------- 1. Storage ----------
        self.storage = FileStorage(
            first_entry_file_name="./example_data/image_contextvqa/sample_data.json",
            cache_path="./cache_local",
            file_name_prefix="context_vqa",
            cache_type="json",
        )

        # ---------- 2. Serving ----------
        self.vlm_serving = LocalModelVLMServing_vllm(
            hf_model_name_or_path="Qwen/Qwen2.5-VL-3B-Instruct",
            hf_cache_dir="~/.cache/huggingface",
            hf_local_dir="./ckpt",
            vllm_tensor_parallel_size=1,
            vllm_temperature=0.7,
            vllm_top_p=0.9,
            vllm_max_tokens=512,
        )

        # ---------- 3. Operator ----------
        self.vqa_generator = PromptedVQAGenerator(
            serving=self.vlm_serving,
            system_prompt= "You are a helpful assistant."
        )

        self.refiner = WikiQARefiner()
        
    # ------------------------------------------------------------------ #
    def forward(self):
        input_image_key = "image"
        output_answer_key = "vqa"
        output_wiki_key = "context_vqa"

        self.vqa_generator.run(
            storage=self.storage.step(),
            input_conversation_key="conversation",
            input_image_key=input_image_key,
            output_answer_key=output_answer_key
        )

        self.refiner.run(
            storage=self.storage.step(),
            input_key=output_answer_key,
            output_key=output_wiki_key
        )

# ---------------------------- CLI Entry ------------------------------- #
if __name__ == "__main__":
    pipe = ContextVQAPipeline()
    pipe.forward()

Image Generation

Image Editing