Case 5. Visual Question Answering (VQA)

About 580 wordsAbout 2 min

2025-07-16

The VQA operator allows users to define a text prompt and then accept a JSON file where each record provides an image path. The VQA operator sends both the image and the text prompt to the model for inference. In just a few simple steps, you can complete tasks like OCR, visual question answering, image information extraction, and more.

Step 1: Install Dataflow

pip install open-dataflow

Step 2: Create a new dataflow working directory

mkdir run_dataflow
cd run_dataflow

Step 3: Initialize Dataflow

dataflow init

After initialization you will see:

run_dataflow/playground/vqa.py

Step 4: Configure API Key

If you are using the OpenAI API, you need to set an environment variable first:

Linux/Mac:

export DF_API_KEY="sk-xxxxxx"

Windows PowerShell:

$env:DF_API_KEY = "sk-xxxxxx"

Step 5: Prepare Image Path Data

In the project root directory, create a JSON file, for example pic_path.json, with the following format:

[
  {"raw_content": "/absolute/path/to/image1.jpg"},
  {"raw_content": "/absolute/path/to/image2.png"}
]

The operator will read the field name you pass in as the image path to the model. The default field name is raw_content, but you can customize it in the invocation script. Picture must be jpg(jpeg) or png。

Step 6: Modify and Use the VQA Script as Needed

Refer to the example code below:

from dataflow.operators.generate.Vqa.PromptedVQAGenerator import PromptedVQAGenerator
from dataflow.serving.APIVLMServing_openai import APIVLMServing_openai
from dataflow.utils.storage import FileStorage

class Vqa_generator:
    def __init__(self):
        # Custom prompt; can be changed to OCR, information extraction, etc., as needed
        self.prompt = "Describe the image in detail."

        # Specify input file and cache directory
        self.storage = FileStorage(
            first_entry_file_name="pic_path.json",
            cache_path="./cache",
            file_name_prefix="vqa",
            cache_type="json",
        )

        # Call the OpenAI API
        self.llm_serving = APIVLMServing_openai(
            model_name="o4-mini",
            api_url="https://api.openai.com/v1",
            key_name_of_api_key="DF_API_KEY",
        )

        # Build the VQA operator
        self.vqa_generate = PromptedVQAGenerator(
            self.llm_serving,
            self.prompt
        )

    def forward(self):
        self.vqa_generate.run(
            storage=self.storage.step(),
            input_key="raw_content"
        )

if __name__ == "__main__":
    Vqa_generator().forward()

Parameter Notes:

self.prompt: Guides the model for description, Q&A, OCR, etc.
storage.step(): Reads each raw_content entry from pic_path.json in sequence.

The results will be written to files like ./cache/vqa-0.json, vqa-1.json, etc., with the following format:

{
  "raw_content": "/absolute/path/to/image1.jpg",
  "result": "A close-up photo of a red apple on a wooden table."
}

We provide an example file for running this operator at dataflow/example/Vqa/pic_path.json. Just fill in your own API URL and API key to experience the VQA operator in one click.

Operator Logic Description

This operator is implemented based on APIVLMServing_openai in dataflow/serving/APIVLMServing_openai.py, and primarily provides the basic functionality for OpenAI‐style image question answering, with a built‐in concurrent invocation mechanism. Its workflow is as follows:

The server encodes the input image into a Base64 string via the _encode_image_to_base64 method;
The user‐provided text prompt and the aforementioned Base64 image are concatenated into a complete message body according to the model interface specification;
This message body is then passed to the OpenAI‐style model for processing.

Instruction format example:

# fmt : png & jpg
content = [
    {"type": "text",      "text": text_prompt},
    {"type": "image_url", "image_url": {"url": f"data:image/{fmt};base64,{b64}"}}
]

Example Prompts

OCR text recognition:
self.prompt = "Please recognize and output all text present in the image."
Visual question answering:
self.prompt = "What is the person doing in this image?"
Object detection and attribute extraction:
self.prompt = "Extract main objects and their attributes from the image."

Operator source code location:
/dataflow/statics/playground/playground/vqa.py

By customizing the prompt, you can quickly reuse this workflow to accomplish various vision-instruction tasks!