Video Caption Pipeline

About 809 wordsAbout 3 min

2025-07-16

1. Overview

The Video Caption Generation Pipeline leverages Vision-Language Models (VLM) to automatically generate high-quality descriptive text for videos, suitable for video annotation, multimodal dataset construction, and video understanding tasks.

We support the following use cases:

Automatic video content annotation and caption generation
Multimodal training dataset construction
Video understanding and analysis tasks

The main stages of the pipeline include:

Data Loading: Read video files and conversation format data.
Video Understanding: Analyze video content using VLM models.
Caption Generation: Generate detailed text descriptions based on video content.

2. Quick Start

Step 1: Create a new DataFlow workspace

mkdir run_dataflow_mm
cd run_dataflow_mm

Step 2: Initialize DataFlow-MM

dataflowmm init

You will see:

run_dataflow_mm/playground/video_caption_pipeline.py

Step 3: Configure model path and Prompt

In video_caption_pipeline.py, configure the VLM model path and prompt template:

# VLM model configuration
self.vlm_serving = LocalModelVLMServing_vllm(
    hf_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct",  # Modify to your model path
    hf_cache_dir='./dataflow_cache',
    vllm_tensor_parallel_size=1,
    vllm_temperature=0.7,
    vllm_top_p=0.9,
    vllm_max_tokens=2048,
    vllm_max_model_len=51200,
    vllm_gpu_memory_utilization=0.9
)

# Prompt Template configuration
self.prompt_template = VideoCaptionGeneratorPrompt()

# Operator initialization
self.prompted_vqa_generator = PromptedVQAGenerator(
    serving=self.vlm_serving,
    system_prompt="You are a helpful assistant.",
    prompt_template=self.prompt_template
)

Step 4: One-click run

python playground/video_caption_pipeline.py

API Version

If you prefer to use an API service instead of a local model, you can use the API version of the pipeline:

python api_pipelines/video_caption_api_pipeline.py

The API version works similarly to the local version; you just need to configure the API Key and service URL. See api_pipelines/video_caption_api_pipeline.py for configuration details.

You can also run any other pipeline script as needed; the process is similar. Below we introduce the PromptedVQAGenerator operator used in the pipeline and how to configure it.

3. Data Flow and Pipeline Logic

1. Input Data

The pipeline input includes the following fields:

video: List of video file paths, e.g., ["path/to/video.mp4"]
conversation: Conversation format data, e.g., [{"from": "human", "value": ""}]
image (optional): List of image file paths for processing images simultaneously

Inputs can be stored in designated files (such as json or jsonl) and managed and read via the FileStorage object. In the provided example, the default data path is loaded; in practice, you can modify the path to load custom data and cache paths:

self.storage = FileStorage(
    first_entry_file_name="./dataflow/example/video_caption/sample_data.json",
    cache_path="./cache",
    file_name_prefix="video_caption",
    cache_type="json",
)

Input Data Example:

[
    {
        "video": ["./videos/sample1.mp4"],
        "conversation": [{"from": "human", "value": ""}]
    },
    {
        "video": ["./videos/sample2.mp4"],
        "conversation": [{"from": "human", "value": ""}]
    }
]

2. Video Caption Generation (PromptedVQAGenerator)

The core step of the pipeline is to use the Prompted VQA Generator (PromptedVQAGenerator) combined with VideoCaptionGeneratorPrompt to generate detailed text descriptions for each video.

Functionality:

Analyze video content using VLM models and generate descriptive text
Use predefined prompt templates to guide the model in generating high-quality captions
Support custom system prompts and description styles
Configurable generation parameters (temperature, top_p, etc.)

Input: Video file paths and conversation format data
Output: Generated video caption text

Model Service Configuration:

self.vlm_serving = LocalModelVLMServing_vllm(
    hf_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct",
    hf_cache_dir='./dataflow_cache',
    vllm_tensor_parallel_size=1,          # Set to 1 for single GPU, or number of GPUs
    vllm_temperature=0.7,                 # Generation temperature, controls randomness
    vllm_top_p=0.9,                       # Top-p sampling parameter
    vllm_max_tokens=2048,                 # Maximum generation tokens
    vllm_max_model_len=51200,             # Maximum model context length
    vllm_gpu_memory_utilization=0.9       # GPU memory utilization
)

Prompt Template Configuration:

self.prompt_template = VideoCaptionGeneratorPrompt()

Operator Initialization:

self.prompted_vqa_generator = PromptedVQAGenerator(
    serving=self.vlm_serving,
    system_prompt="You are a helpful assistant.",  # System prompt
    prompt_template=self.prompt_template           # Prompt template for video caption generation
)

Operator Run:

self.prompted_vqa_generator.run(
    storage=self.storage.step(),
    input_image_key="image",                 # Input image field (optional)
    input_video_key="video",                 # Input video field
    input_conversation_key="conversation",   # Input conversation field
    output_answer_key="caption",             # Output caption field
)

3. Output Data

The final output includes:

video: Original video path
conversation: Updated conversation data (including generated prompts)
caption: Generated video caption text

Output Data Example:

{
    "video": ["./videos/sample1.mp4"],
    "conversation": [{"from": "human", "value": "Please describe the video in detail."}],
    "caption": "This video shows a person walking in a park on a sunny day. The weather is clear and bright, with trees and benches visible in the background. The person is wearing casual clothes and walking at a leisurely pace, seemingly enjoying a peaceful outdoor moment."
}

4. Pipeline Example

An example pipeline demonstrating how to use PromptedVQAGenerator combined with VideoCaptionGeneratorPrompt for video caption generation:

from dataflow.operators.core_vision import PromptedVQAGenerator
from dataflow.serving import LocalModelVLMServing_vllm
from dataflow.utils.storage import FileStorage
from dataflow.prompts.video import VideoCaptionGeneratorPrompt

class VideoCaptionGenerator():
    def __init__(self):
        # -------- Storage Configuration --------
        self.storage = FileStorage(
            first_entry_file_name="./dataflow/example/video_caption/sample_data.json",
            cache_path="./cache",
            file_name_prefix="video_caption",
            cache_type="json",
        )

        # -------- VLM Model Service --------
        self.vlm_serving = LocalModelVLMServing_vllm(
            hf_model_name_or_path="Qwen/Qwen2.5-VL-7B-Instruct",
            hf_cache_dir='./dataflow_cache',
            vllm_tensor_parallel_size=1,
            vllm_temperature=0.7,
            vllm_top_p=0.9, 
            vllm_max_tokens=2048,
            vllm_max_model_len=51200,  
            vllm_gpu_memory_utilization=0.9
        )

        # -------- Prompt Template Configuration --------
        self.prompt_template = VideoCaptionGeneratorPrompt()

        # -------- Video Caption Generator Operator --------
        self.prompted_vqa_generator = PromptedVQAGenerator(
            serving=self.vlm_serving,
            system_prompt="You are a helpful assistant.",
            prompt_template=self.prompt_template
        )

    def forward(self):
        # Call PromptedVQAGenerator to generate captions
        self.prompted_vqa_generator.run(
            storage=self.storage.step(),
            input_image_key="image",
            input_video_key="video",
            input_conversation_key="conversation",
            output_answer_key="caption",
        )

if __name__ == "__main__":
    # Pipeline entry point
    model = VideoCaptionGenerator()
    model.forward()

Image Generation

Image Editing