Image VQA Generation Pipeline (API Version)

About 637 wordsAbout 2 min

2026-02-10

1. Overview

Image VQA Generation Pipeline (API Version) focuses on automatically constructing high-quality Question-Answer (QA) Pairs directly from image content. Leveraging high-performance VLM APIs, this pipeline generates human-like questions and accurate answers based on the visual features of an image. This is highly valuable for training multimodal dialogue models, evaluating visual understanding capabilities, and building industry-specific VQA datasets (e.g., medical, security, e-commerce).

We support the following application scenarios:

Instruction Fine-tuning Data Synthesis: Generate diverse questioning styles to enhance model interaction capabilities.
Visual Understanding Evaluation: Produce judgment, descriptive, or reasoning-based QAs targeting specific image details.
Automated Annotation: Replace manual labor for large-scale image QA annotation, reducing data production costs.

2. Quick Start

Step 1: Configure API Key

Ensure your environment variables include the API access rights:

import os
os.environ["DF_API_KEY"] = "sk-your-key-here"

Step 2: Initialize Environment

# Create and enter the workspace
mkdir run_vqa_dataflow
cd run_vqa_dataflow

# Initialize DataFlow-MM configuration
dataflowmm init

Step 3: Download Example Data

huggingface-cli download --repo-type dataset OpenDCAI/dataflow-demo-image --local-dir example_data

Step 4: Configure Running Script

In api_pipelines/image_vqa.py, you can customize the VLM model name and API information:

self.vlm_serving = APIVLMServing_openai(
    api_url="http://172.96.141.132:3001/v1", # Supports any OpenAI-compatible interface
    key_name_of_api_key="DF_API_KEY",
    model_name="gpt-5-nano-2025-08-07",
    max_workers=10
)

Step 5: Execute the Pipeline

python api_pipelines/image_vqa.py

3. Data Flow and Logic Description

1. Input Data Format

The input file must contain the image path and a prompt to guide the VQA generation:

[
    {
        "image": ["./example_data/image_vqa/person.png"],
        "conversation": [
            {
                "from": "human",
                "value": "Please generate a relevant question based on the content of the picture, and only output the question content."
            }
        ]
    }
]

2. Core Operator: PromptedVQAGenerator

This operator serves as the engine for generating QA pairs:

Role Definition: Through the system_prompt, the model is set as an "image question-answer generator," guiding it to output standard QA formats.
Multi-turn Support: It can combine historical context or specific instructions in the conversation field to refine the focus of question generation.
High Throughput Processing: Utilizes max_workers to implement parallel calls, suitable for processing data at a scale of tens of thousands of images or more.

3. Output Result Example

The generated VQA results are stored as text in the vqa field, typically containing multiple Q&A sets:

[
  {
    "image": ["./example_data/image_vqa/person.png"],
    "conversation":[
      {
        "from":"human",
        "value":"Please generate a relevant question based on the content of the picture, and only output the question content."
      }
    ],
    "question":"Who is the main actor in the movie \"Nightmare Alley\"?",
    "answer":"The main actor in the movie \"Nightmare Alley\" is Bradley Cooper."
  }
]

4. Complete Pipeline Code

import os

# Set API Key environment variable
os.environ["DF_API_KEY"] = "sk-xxx"

from dataflow.utils.storage import FileStorage
from dataflow.core import LLMServingABC
from dataflow.serving.api_vlm_serving_openai import APIVLMServing_openai
from dataflow.operators.core_vision import PromptedVQAGenerator


class ImageVQAPipeline:
    """
    Generate batch VQA for images with a single command.
    """

    def __init__(self, llm_serving: LLMServingABC = None):

        # ---------- 1. Storage ----------
        self.storage = FileStorage(
            first_entry_file_name="./example_data/image_vqa/sample_data.json",
            cache_path="./cache_local",
            file_name_prefix="qa_api",
            cache_type="json",
        )

        # ---------- 2. Serving ----------
        self.vlm_serving = APIVLMServing_openai(
            api_url="https://dashscope.aliyuncs.com/compatible-mode/v1", # Any API platform compatible with OpenAI format
            key_name_of_api_key="DF_API_KEY", # Set the API key for the corresponding platform in the environment variable or line 4
            model_name="qwen3-vl-8b-instruct",
            image_io=None,
            send_request_stream=False,
            max_workers=10,
            timeout=1800
        )

        # ---------- 3. Operator ----------
        self.vqa_generator = PromptedVQAGenerator(
            serving=self.vlm_serving,
            system_prompt= "You are a image question-answer generator. Your task is to generate a question-answer pair for the given image content."
        )

    # ------------------------------------------------------------------ #
    def forward(self):
        input_image_key = "image"
        output_step1_key = "question"
        output_step2_key = "answer"

        # Step 1: Generate the question for the image
        self.vqa_generator.run(
            storage=self.storage.step(),
            input_conversation_key="conversation",
            input_image_key=input_image_key,
            output_answer_key=output_step1_key,
        )

        # Step 2: Generate the answer for the question
        self.vqa_generator.run(
            storage=self.storage.step(),
            input_prompt_key=output_step1_key,
            input_image_key=input_image_key,
            output_answer_key=output_step2_key,
        )

# ---------------------------- CLI 入口 -------------------------------- #
if __name__ == "__main__":
    pipe = ImageVQAPipeline()
    pipe.forward()

Image Generation

Image Editing