ContextVQA Multimodal QA Data Generation Pipeline (API Version)

About 780 wordsAbout 3 min

2026-01-24

1. Overview

The ContextVQA Multimodal QA Data Generation Pipeline (API Version) is designed to automatically generate visual question answering data with external knowledge context (Context-based VQA) starting from an image. This pipeline uses a Vision-Language Model (VLM) via API to generate Wikipedia-style articles and QA pairs, which are then parsed into structured data. This is ideal for building knowledge-based VQA and multimodal RAG (Retrieval-Augmented Generation) datasets.

We support the following application scenarios:

Knowledge-based VQA Data Synthesis: Constructing QA datasets that require external knowledge reasoning.
Multimodal RAG Data Construction: Generating high-quality data for training retrieval-augmented generation models.
Visual Reasoning Training: Generating questions that point to an image but require answers derived from textual context reasoning.

The main flow of the pipeline includes:

Data Loading: Reading data files containing image paths.
Context and QA Generation: Using a VLM API to generate Wikipedia-style articles and raw QA pairs based on images.
Data Cleaning and Structuring: Parsing raw text to extract a structured {context, qas} format.

2. Quick Start

Step 1: Configure API Key

Set the API Key environment variable in your script:

import os
os.environ["DF_API_KEY"] = "sk-xxx"

Step 2: Create a New DataFlow Work Folder

mkdir run_dataflow
cd run_dataflow

Step 3: Initialize DataFlow-MM

dataflowmm init

You will see the following file created:

api_pipelines/image_contextvqa.py

Step 4: Download Example Data

huggingface-cli download --repo-type dataset OpenDCAI/dataflow-demo-image --local-dir example_data

Step 5: Configure Parameters

In image_contextvqa.py, configure the API service and input data paths (no argparse required, modify default paths directly in the code):

self.vlm_serving = APIVLMServing_openai(
    api_url="http://172.96.141.132:3001/v1", # Any OpenAI-compatible API platform
    key_name_of_api_key="DF_API_KEY", # Corresponding API key set in Step 1
    model_name="gpt-5-nano-2025-08-07",
    image_io=None,
    send_request_stream=False,
    max_workers=10,
    timeout=1800
)

self.storage = FileStorage(
    first_entry_file_name="./example_data/image_contextvqa/sample_data.json",
    cache_path="./cache_local",
    file_name_prefix="context_vqa",
    cache_type="json",
)

Step 6: One-Click Run

python api_pipelines/image_contextvqa.py

3. Data Flow and Pipeline Logic

1. Input Data

The input data for this process mainly includes the following fields:

image: Path to the image file (local path or URL).
id (Optional): Unique identifier for the data.
conversation (Optional): Text in dialogue format used to supplement context generation.

Data is managed through FileStorage, which supports breakpoint resumption.

Input Data Example:

[
    {
        "image": ["./example_data/image_contextvqa/person.png"],
        "conversation": [
            {
                "from": "human",
                "value": "Write a Wikipedia article related to this image without directly referring to the image..."
            }
        ]
    }
]

2. Core Operator Logic

This pipeline completes the task by concatenating two core operators:

A. PromptedVQAGenerator (Context Generation)

This operator is responsible for calling the VLM API to generate raw text based on a prompt template.

Features:

Generates a Wikipedia-style popular science article based on the image.
Generates QA pairs based on the article.
Prompt Constraints: Questions refer to the image but avoid mentioning object names; answers are from the article and are not objects in the image; answers are concise.

Operator Execution:

self.vqa_generator.run(
    storage=self.storage.step(),
    input_conversation_key="conversation",
    input_image_key="image",
    output_answer_key="vqa"
)

B. WikiQARefiner (Result Parsing)

This operator cleans the raw text generated by the VLM and converts it into a standard format.