Image VQA Generation Pipeline (API Version)
About 637 wordsAbout 2 min
2026-02-10
1. Overview
Image VQA Generation Pipeline (API Version) focuses on automatically constructing high-quality Question-Answer (QA) Pairs directly from image content. Leveraging high-performance VLM APIs, this pipeline generates human-like questions and accurate answers based on the visual features of an image. This is highly valuable for training multimodal dialogue models, evaluating visual understanding capabilities, and building industry-specific VQA datasets (e.g., medical, security, e-commerce).
We support the following application scenarios:
- Instruction Fine-tuning Data Synthesis: Generate diverse questioning styles to enhance model interaction capabilities.
- Visual Understanding Evaluation: Produce judgment, descriptive, or reasoning-based QAs targeting specific image details.
- Automated Annotation: Replace manual labor for large-scale image QA annotation, reducing data production costs.
2. Quick Start
Step 1: Configure API Key
Ensure your environment variables include the API access rights:
import os
os.environ["DF_API_KEY"] = "sk-your-key-here"Step 2: Initialize Environment
# Create and enter the workspace
mkdir run_vqa_dataflow
cd run_vqa_dataflow
# Initialize DataFlow-MM configuration
dataflowmm initStep 3: Download Example Data
huggingface-cli download --repo-type dataset OpenDCAI/dataflow-demo-image --local-dir example_dataStep 4: Configure Running Script
In api_pipelines/image_vqa.py, you can customize the VLM model name and API information:
self.vlm_serving = APIVLMServing_openai(
api_url="http://172.96.141.132:3001/v1", # Supports any OpenAI-compatible interface
key_name_of_api_key="DF_API_KEY",
model_name="gpt-5-nano-2025-08-07",
max_workers=10
)Step 5: Execute the Pipeline
python api_pipelines/image_vqa.py3. Data Flow and Logic Description
1. Input Data Format
The input file must contain the image path and a prompt to guide the VQA generation:
[
{
"image": ["./example_data/image_vqa/person.png"],
"conversation": [
{
"from": "human",
"value": "Please generate a relevant question based on the content of the picture, and only output the question content."
}
]
}
]2. Core Operator: PromptedVQAGenerator
This operator serves as the engine for generating QA pairs:
- Role Definition: Through the
system_prompt, the model is set as an "image question-answer generator," guiding it to output standard QA formats. - Multi-turn Support: It can combine historical context or specific instructions in the
conversationfield to refine the focus of question generation. - High Throughput Processing: Utilizes
max_workersto implement parallel calls, suitable for processing data at a scale of tens of thousands of images or more.
3. Output Result Example
The generated VQA results are stored as text in the vqa field, typically containing multiple Q&A sets:
[
{
"image": ["./example_data/image_vqa/person.png"],
"conversation":[
{
"from":"human",
"value":"Please generate a relevant question based on the content of the picture, and only output the question content."
}
],
"question":"Who is the main actor in the movie \"Nightmare Alley\"?",
"answer":"The main actor in the movie \"Nightmare Alley\" is Bradley Cooper."
}
]4. Complete Pipeline Code
import os
# Set API Key environment variable
os.environ["DF_API_KEY"] = "sk-xxx"
from dataflow.utils.storage import FileStorage
from dataflow.core import LLMServingABC
from dataflow.serving.api_vlm_serving_openai import APIVLMServing_openai
from dataflow.operators.core_vision import PromptedVQAGenerator
class ImageVQAPipeline:
"""
Generate batch VQA for images with a single command.
"""
def __init__(self, llm_serving: LLMServingABC = None):
# ---------- 1. Storage ----------
self.storage = FileStorage(
first_entry_file_name="./example_data/image_vqa/sample_data.json",
cache_path="./cache_local",
file_name_prefix="qa_api",
cache_type="json",
)
# ---------- 2. Serving ----------
self.vlm_serving = APIVLMServing_openai(
api_url="https://dashscope.aliyuncs.com/compatible-mode/v1", # Any API platform compatible with OpenAI format
key_name_of_api_key="DF_API_KEY", # Set the API key for the corresponding platform in the environment variable or line 4
model_name="qwen3-vl-8b-instruct",
image_io=None,
send_request_stream=False,
max_workers=10,
timeout=1800
)
# ---------- 3. Operator ----------
self.vqa_generator = PromptedVQAGenerator(
serving=self.vlm_serving,
system_prompt= "You are a image question-answer generator. Your task is to generate a question-answer pair for the given image content."
)
# ------------------------------------------------------------------ #
def forward(self):
input_image_key = "image"
output_step1_key = "question"
output_step2_key = "answer"
# Step 1: Generate the question for the image
self.vqa_generator.run(
storage=self.storage.step(),
input_conversation_key="conversation",
input_image_key=input_image_key,
output_answer_key=output_step1_key,
)
# Step 2: Generate the answer for the question
self.vqa_generator.run(
storage=self.storage.step(),
input_prompt_key=output_step1_key,
input_image_key=input_image_key,
output_answer_key=output_step2_key,
)
# ---------------------------- CLI 入口 -------------------------------- #
if __name__ == "__main__":
pipe = ImageVQAPipeline()
pipe.forward()
