Video Caption Pipeline
About 809 wordsAbout 3 min
2025-07-16
1. Overview
The Video Caption Generation Pipeline leverages Vision-Language Models (VLM) to automatically generate high-quality descriptive text for videos, suitable for video annotation, multimodal dataset construction, and video understanding tasks.
We support the following use cases:
- Automatic video content annotation and caption generation
- Multimodal training dataset construction
- Video understanding and analysis tasks
The main stages of the pipeline include:
- Data Loading: Read video files and conversation format data.
- Video Understanding: Analyze video content using VLM models.
- Caption Generation: Generate detailed text descriptions based on video content.
2. Quick Start
Step 1: Create a new DataFlow workspace
mkdir run_dataflow_mm
cd run_dataflow_mmStep 2: Initialize DataFlow-MM
dataflowmm initYou will see:
run_dataflow_mm/playground/video_caption_pipeline.pyStep 3: Configure model path and Prompt
In video_caption_pipeline.py, configure the VLM model path and prompt template:
# VLM model configuration
self.vlm_serving = LocalModelVLMServing_vllm(
hf_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct", # Modify to your model path
hf_cache_dir='./dataflow_cache',
vllm_tensor_parallel_size=1,
vllm_temperature=0.7,
vllm_top_p=0.9,
vllm_max_tokens=2048,
vllm_max_model_len=51200,
vllm_gpu_memory_utilization=0.9
)
# Prompt Template configuration
self.prompt_template = VideoCaptionGeneratorPrompt()
# Operator initialization
self.prompted_vqa_generator = PromptedVQAGenerator(
serving=self.vlm_serving,
system_prompt="You are a helpful assistant.",
prompt_template=self.prompt_template
)Step 4: One-click run
python playground/video_caption_pipeline.pyAPI Version
If you prefer to use an API service instead of a local model, you can use the API version of the pipeline:
python api_pipelines/video_caption_api_pipeline.pyThe API version works similarly to the local version; you just need to configure the API Key and service URL. See api_pipelines/video_caption_api_pipeline.py for configuration details.
You can also run any other pipeline script as needed; the process is similar. Below we introduce the PromptedVQAGenerator operator used in the pipeline and how to configure it.
3. Data Flow and Pipeline Logic
1. Input Data
The pipeline input includes the following fields:
- video: List of video file paths, e.g.,
["path/to/video.mp4"] - conversation: Conversation format data, e.g.,
[{"from": "human", "value": ""}] - image (optional): List of image file paths for processing images simultaneously
Inputs can be stored in designated files (such as json or jsonl) and managed and read via the FileStorage object. In the provided example, the default data path is loaded; in practice, you can modify the path to load custom data and cache paths:
self.storage = FileStorage(
first_entry_file_name="./dataflow/example/video_caption/sample_data.json",
cache_path="./cache",
file_name_prefix="video_caption",
cache_type="json",
)Input Data Example:
[
{
"video": ["./videos/sample1.mp4"],
"conversation": [{"from": "human", "value": ""}]
},
{
"video": ["./videos/sample2.mp4"],
"conversation": [{"from": "human", "value": ""}]
}
]2. Video Caption Generation (PromptedVQAGenerator)
The core step of the pipeline is to use the Prompted VQA Generator (PromptedVQAGenerator) combined with VideoCaptionGeneratorPrompt to generate detailed text descriptions for each video.
Functionality:
- Analyze video content using VLM models and generate descriptive text
- Use predefined prompt templates to guide the model in generating high-quality captions
- Support custom system prompts and description styles
- Configurable generation parameters (temperature, top_p, etc.)
Input: Video file paths and conversation format data
Output: Generated video caption text
Model Service Configuration:
self.vlm_serving = LocalModelVLMServing_vllm(
hf_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct",
hf_cache_dir='./dataflow_cache',
vllm_tensor_parallel_size=1, # Set to 1 for single GPU, or number of GPUs
vllm_temperature=0.7, # Generation temperature, controls randomness
vllm_top_p=0.9, # Top-p sampling parameter
vllm_max_tokens=2048, # Maximum generation tokens
vllm_max_model_len=51200, # Maximum model context length
vllm_gpu_memory_utilization=0.9 # GPU memory utilization
)Prompt Template Configuration:
self.prompt_template = VideoCaptionGeneratorPrompt()Operator Initialization:
self.prompted_vqa_generator = PromptedVQAGenerator(
serving=self.vlm_serving,
system_prompt="You are a helpful assistant.", # System prompt
prompt_template=self.prompt_template # Prompt template for video caption generation
)Operator Run:
self.prompted_vqa_generator.run(
storage=self.storage.step(),
input_image_key="image", # Input image field (optional)
input_video_key="video", # Input video field
input_conversation_key="conversation", # Input conversation field
output_answer_key="caption", # Output caption field
)3. Output Data
The final output includes:
- video: Original video path
- conversation: Updated conversation data (including generated prompts)
- caption: Generated video caption text
Output Data Example:
{
"video": ["./videos/sample1.mp4"],
"conversation": [{"from": "human", "value": "Please describe the video in detail."}],
"caption": "This video shows a person walking in a park on a sunny day. The weather is clear and bright, with trees and benches visible in the background. The person is wearing casual clothes and walking at a leisurely pace, seemingly enjoying a peaceful outdoor moment."
}4. Pipeline Example
An example pipeline demonstrating how to use PromptedVQAGenerator combined with VideoCaptionGeneratorPrompt for video caption generation:
from dataflow.operators.core_vision import PromptedVQAGenerator
from dataflow.serving import LocalModelVLMServing_vllm
from dataflow.utils.storage import FileStorage
from dataflow.prompts.video import VideoCaptionGeneratorPrompt
class VideoCaptionGenerator():
def __init__(self):
# -------- Storage Configuration --------
self.storage = FileStorage(
first_entry_file_name="./dataflow/example/video_caption/sample_data.json",
cache_path="./cache",
file_name_prefix="video_caption",
cache_type="json",
)
# -------- VLM Model Service --------
self.vlm_serving = LocalModelVLMServing_vllm(
hf_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct",
hf_cache_dir='./dataflow_cache',
vllm_tensor_parallel_size=1,
vllm_temperature=0.7,
vllm_top_p=0.9,
vllm_max_tokens=2048,
vllm_max_model_len=51200,
vllm_gpu_memory_utilization=0.9
)
# -------- Prompt Template Configuration --------
self.prompt_template = VideoCaptionGeneratorPrompt()
# -------- Video Caption Generator Operator --------
self.prompted_vqa_generator = PromptedVQAGenerator(
serving=self.vlm_serving,
system_prompt="You are a helpful assistant.",
prompt_template=self.prompt_template
)
def forward(self):
# Call PromptedVQAGenerator to generate captions
self.prompted_vqa_generator.run(
storage=self.storage.step(),
input_image_key="image",
input_video_key="video",
input_conversation_key="conversation",
output_answer_key="caption",
)
if __name__ == "__main__":
# Pipeline entry point
model = VideoCaptionGenerator()
model.forward()
