FixPromptedVQAGenerator
About 421 wordsAbout 1 min
2026-01-11
📘 Overview
FixPromptedVQAGenerator is a Fixed-Prompt Multimodal VQA Operator.
It is designed to execute the same instruction task on a batch of images or videos. Unlike dynamic templating operators, this operator accepts a static user_prompt (e.g., "Please caption this image") during initialization and applies it uniformly to every media sample in the input DataFrame.
Use Cases:
- Batch Image/Video Captioning.
- Uniform VQA queries across a dataset (e.g., "Is there any violence in this image?").
🏗️ __init__ Function
def __init__(
self,
serving: LLMServingABC,
system_prompt: str = "You are a helpful assistant.",
user_prompt: str = "Please caption the media in detail."
):🧾 Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
serving | LLMServingABC | N/A | The model serving instance for inference (must support multimodal inputs). |
system_prompt | str | "You are..." | The system prompt sent to the model. |
user_prompt | str | "Please caption..." | Core Parameter. The user instruction (Prompt) applied uniformly to all input samples. |
⚡ run Function
def run(
self,
storage: DataFlowStorage,
input_image_key: str = "image",
input_video_key: str = "video",
output_answer_key: str = "answer",
):
...Executes the main logic:
- Read Data Reads the DataFrame from
storage. - Input Construction
- Checks for and reads the
input_image_keyorinput_video_keycolumn. - Constructs the input message for each media file, combining the fixed
system_prompt, the media file itself, and the fixeduser_prompt.
- Batch Inference
- Packages the constructed prompts and media data into a batch.
- Calls
serving.generate_from_inputto execute parallel inference.
- Save Results
- Writes the text generated by the model into the
output_answer_keycolumn. - Updates and saves the DataFrame.
🧾 run Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage | DataFlowStorage | N/A | DataFlow storage object. |
input_image_key | str | "image" | Column name for image paths (mutually exclusive with video_key). |
input_video_key | str | "video" | Column name for video paths (mutually exclusive with image_key). |
output_answer_key | str | "answer" | Column name for the generated output. |
🧩 Example Usage
from dataflow.utils.storage import FileStorage
from dataflow.core import LLMServing
from dataflow.operators.generate import FixPromptedVQAGenerator
# 1) Initialize Model
serving = LLMServing(model_path="Qwen/Qwen2.5-VL-3B-Instruct")
# 2) Initialize Operator: Set a fixed prompt
# Example: Generate detailed descriptions for a batch of images
generator = FixPromptedVQAGenerator(
serving=serving,
system_prompt="You are a helpful visual assistant.",
user_prompt="Describe the content of this image in detail, including objects, colors, and spatial relationships."
)
# 3) Prepare Data
storage = FileStorage(
file_name_prefix="image_captioning_task",
cache_path="./cache_data"
)
storage.step()
# 4) Execute Generation
generator.run(
storage=storage,
input_image_key="image_path",
output_answer_key="detailed_caption"
)🧾 Input/Output Example
Input DataFrame Row:
| image_path |
|---|
"/data/cat.jpg" |
"/data/dog.png" |
Output DataFrame Row:
| image_path | detailed_caption |
|---|---|
"/data/cat.jpg" | "A black and white cat sitting on a sofa..." |
"/data/dog.png" | "A golden retriever running on the grass..." |

