Video QA Generation (VideoCaptionToQAGenerator)

About 705 wordsAbout 2 min

2025-12-20

📘 Overview

VideoCaptionToQAGenerator is an operator for automatically generating question-answer pairs (Video QA) based on video captions .
It automatically constructs prompts based on input video captions and guides the model to generate questions and answers related to the video content. Suitable for video QA dataset construction, video understanding evaluation, multimodal dialogue systems, and more.

🏗️ `init` Function

def __init__(
    self,
    vlm_serving: VLMServingABC,
    prompt_template: Optional[VideoQAGeneratorPrompt | DiyVideoPrompt | str] = None,
    use_video_input: bool = True,
):
    ...

🧾 `init` Parameters

Parameter	Type	Default	Description
`vlm_serving`	`VLMServingABC`	-	VLM model serving for generating QA
`prompt_template`	`VideoQAGeneratorPrompt` \| `DiyVideoPrompt` \| `str` \| `None`	`None`	Prompt template, defaults to `VideoQAGeneratorPrompt`
`use_video_input`	`bool`	`True`	Whether to use video as input (False for text-only QA without video to model)

⚡ `run` Function

def run(
    self,
    storage: DataFlowStorage,
    input_image_key: str = None,
    input_video_key: str = None,
    input_conversation_key: str = "conversation",
    input_caption_key: str = "caption",
    output_key: str = "answer",
) -> str:
    ...

run is the main logic for video QA generation: Read caption text → Build QA generation prompt → Call VLM model → Generate QA pairs → Write to output.

Returns: The output_key field name (string type).

🧾 `run` Parameters

Parameter	Type	Default	Description
`storage`	`DataFlowStorage`	-	DataFlow storage object
`input_image_key`	`str`	`None`	Field name for images in input (optional)
`input_video_key`	`str`	`None`	Field name for videos in input (optional)
`input_conversation_key`	`str`	`"conversation"`	Field name for conversations in input
`input_caption_key`	`str`	`"caption"`	Field name for captions in input
`output_key`	`str`	`"answer"`	Field name for generated QA output

🧠 Example Usage

from dataflow.operators.core_vision import VideoCaptionToQAGenerator
from dataflow.serving import LocalModelVLMServing_vllm
from dataflow.utils.storage import FileStorage

# Step 1: Initialize VLM service
vlm_serving = LocalModelVLMServing_vllm(
    hf_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct",
    hf_cache_dir="./model_cache",
    vllm_tensor_parallel_size=1,
    vllm_temperature=0.7,
    vllm_top_p=0.9,
    vllm_max_tokens=2048,
    vllm_max_model_len=51200,
    vllm_gpu_memory_utilization=0.9
)

# Step 2: Prepare input data (must contain caption field)
storage = FileStorage(
    first_entry_file_name="./video_captions.json",
    cache_path="./cache",
    file_name_prefix="video_qa",
    cache_type="json",
)

# Step 3: Initialize and run operator
qa_generator = VideoCaptionToQAGenerator(
    vlm_serving=vlm_serving,
    use_video_input=True,  # Use video input
)
qa_generator.run(
    storage=storage.step(),
    input_video_key="video",
    input_conversation_key="conversation",
    input_caption_key="caption",
    output_key="answer"
)

🧾 Input Format Requirements

Field	Type	Description
`caption`	`str`	Video caption text (required)
`video`	`List[str]`	Video file path list (when using video input)
`image`	`List[str]`	Image file path list (optional)
`conversation`	`List[Dict]`	Conversation history (optional, auto-created/updated)

📥 Example Input

{
  "caption": "A person is walking in a park on a sunny day. They are wearing casual clothes and appear to be enjoying the outdoor scenery.",
  "video": ["./test/example_video.mp4"],
  "conversation": [{"from": "human", "value": ""}]
}

📤 Example Output

{
  "caption": "A person is walking in a park on a sunny day. They are wearing casual clothes and appear to be enjoying the outdoor scenery.",
  "video": ["./test/example_video.mp4"],
  "conversation": [
    {
      "from": "human",
      "value": "Based on this caption: 'A person is walking in a park on a sunny day. They are wearing casual clothes and appear to be enjoying the outdoor scenery.', please generate relevant questions and answers about the video."
    }
  ],
  "answer": "Q1: What is the person doing in the video?\nA1: The person is walking in a park.\n\nQ2: What is the weather like in the video?\nA2: It is a sunny day.\n\nQ3: What is the person wearing?\nA3: The person is wearing casual clothes."
}

🎨 Custom Prompts

Default prompt format:

Based on this caption: '{caption}', please generate relevant questions and answers about the video.

Method 1: Using a String

qa_generator = VideoCaptionToQAGenerator(
    vlm_serving=vlm_serving,
    prompt_template="Based on the following caption: '{caption}', please generate 3 QA pairs related to the video."
)

Method 2: Using a Custom Prompt Class

from dataflow.prompts.video import DiyVideoPrompt

custom_prompt = DiyVideoPrompt(
    "Caption: {caption}\n\nGenerate 5 QA pairs in the format:\nQ: ...\nA: ..."
)

qa_generator = VideoCaptionToQAGenerator(
    vlm_serving=vlm_serving,
    prompt_template=custom_prompt
)

🔄 Typical Workflow

from dataflow.operators.core_vision import (
    VideoToCaptionGenerator,     # Step 1: Generate video caption
    VideoCaptionToQAGenerator    # Step 2: Generate QA based on caption
)

# Step 1: Generate caption for video
caption_generator = VideoToCaptionGenerator(vlm_serving=vlm_serving)
caption_generator.run(storage.step())

# Step 2: Generate QA based on caption
qa_generator = VideoCaptionToQAGenerator(
    vlm_serving=vlm_serving,
    use_video_input=True,  # True: use video and caption; False: caption only
)
qa_generator.run(storage.step())

🧾 Default Output Format

Field	Type	Description
`caption`	`str`	Input video caption
`video`	`List[str]`	Video file path
`conversation`	`List[Dict]`	Updated conversation history
`answer`	`str`	Generated QA pairs text

Code: VideoCaptionToQAGenerator
Related Operators:
- VideoToCaptionGenerator - Video Caption Generation
- VideoMergedCaptionGenerator - Video Merged Caption Generation

generate

eval

filter

refine

generate

eval

filter

generate

eval

filter

generaterow

refine

Video QA Generation (VideoCaptionToQAGenerator)

📘 Overview

🏗️ `init` Function

🧾 `init` Parameters

⚡ `run` Function

🧾 `run` Parameters

🧠 Example Usage

🧾 Input Format Requirements

📥 Example Input

📤 Example Output

🎨 Custom Prompts

Method 1: Using a String

Method 2: Using a Custom Prompt Class

🔄 Typical Workflow

🧾 Default Output Format