图像区域描述生成流水线RegionCap(API版)
1192 字约 4 分钟
2026-01-11
1. 概述
图像区域描述生成流水线(API版) 旨在为图像中的特定区域生成详细的文本描述。该流水线结合了计算机视觉的定位能力与多模态大模型的理解能力,能够识别图像中的感兴趣区域(ROI),并为其生成精确的自然语言标注。
该流水线支持处理预定义边界框 (Bounding Box) 数据,并将其可视化后输入 VLM 进行描述生成。
我们支持以下应用场景:
- 密集描述生成 (Dense Captioning):为图像中的多个物体分别生成描述。
- 细粒度图像理解:关注图像的局部细节而非全局描述。
- 数据集增强:构建带定位信息的图文对数据集。
流水线的主要流程包括:
- 数据加载:读取包含图像和边界框信息的源数据。
- 边界框处理与可视化:处理输入的边界框,生成带有可视化标记(如画框)的图像版本。
- 区域描述生成:利用 VLM 针对标记后的图像或特定区域生成文本描述。
2. 快速开始
第一步:创建新的 DataFlow 工作文件夹
mkdir run_dataflow
cd run_dataflow第二步:初始化 DataFlow-MM
dataflowmm init这时你会看到:
api_pipelines/image_region_caption_api_pipeline.py第三步:下载示例数据
huggingface-cli download --repo-type dataset OpenDCAI/dataflow-demo-image --local-dir ./example_data第四步:配置 API Key
在 api_pipelines/image_region_caption_api_pipeline.py 中设置 API Key 环境变量:
import os
os.environ["DF_API_KEY"] = "your_api_key"第五步:配置参数
在 api_pipelines/image_region_caption_api_pipeline.py 中配置 API 服务和输入数据路径:
def __init__(
self,
first_entry_file: str = "../example_data/image_region_caption/image_region_caption_demo.jsonl",
cache_path: str = "../cache/image_region_caption",
file_name_prefix: str = "region_caption",
cache_type: str = "jsonl",
input_image_key: str = "image",
input_bbox_key: str = "bbox",
max_boxes: int = 10,
output_image_with_bbox_path: str = "../cache/image_region_caption/image_with_bbox_result.jsonl",
):self.vlm_serving = APIVLMServing_openai(
api_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
model_name="gpt-4o-mini",
image_io=None,
send_request_stream=False,
max_workers=10,
timeout=1800
)第六步:一键运行
cd api_pipelines
python image_region_caption_api_pipeline.py3. 数据流与流水线逻辑
1. 输入数据
输入数据通常包含图像路径和对应的边界框列表(可选):
- image:图像文件路径。
- bbox:边界框坐标列表,通常格式为
[[x, y, w, h], ...]。
输入数据示例:
{
"image": "../example_data/image_region_caption/20.jpg",
"bbox": [[196, 104, 310, 495], [50, 60, 100, 200]]
}2. 核心算子逻辑
该流水线通过串联两个核心算子来完成任务:
A. ImageBboxGenerator(边界框处理器)
该算子负责处理视觉层面的任务。
- 输入:原始图像 +
bbox数据。 - 功能:读取边界框,将其绘制在图像上(可视化),或者根据配置进行预处理。
- 配置 (
ExistingBBoxDataGenConfig):控制最大框数量 (max_boxes)和输入输出路径。 - 输出:带有视觉标记的新图像的json文件输出路径。
B. PromptedVQAGenerator(VQA 生成器)
该算子负责利用 VLM 生成文本。
- 输入:上一步的输出。
- 功能:VLM 接收带有标记的图像,根据提示生成对应区域的描述。
- 输出:区域描述文本。
3. 输出数据
最终生成的输出数据将包含处理后的图像路径和生成的描述:
- image:输入的图片路径。
- type:是否给定边界框。
- bbox:边界框参数。
- normalized_bbox:标准化后的边界框参数。
- result_file:结果输出路径。
- image_with_bbox:画了框的图像路径。
- valid_bboxes_num:有效边界框数量。
- prompt:VLM接收的提示词。
- answer:生成的区域描述列表。
输出数据示例:
{
"image":"..\/example_data\/image_region_caption\/20.png",
"type":"with_bbox",
"bbox":[[196,104,310,495]],
"normalized_bbox":[[0.128,0.125,0.329,0.72],[0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0]],
"result_file":"..\/cache\/image_region_caption",
"image_with_bbox":"..\/cache\/image_region_caption\\2_bbox_vis.jpg",
"valid_bboxes_num":1,
"prompt":"Describe the content of each marked region in the image. There are 1 regions: <region1> to <region1>.",
"answer":"In <region1>, the focus is on the lower half of a person wearing high-heeled shoes with an ornate design. The setting appears to be a kitchen, with items such as a table with floral tablecloth, a broom, and various kitchen utensils visible in the background. The legs of another person can also be seen, indicating there may be interaction happening in this domestic space. The overall scene captures a domestic and casual atmosphere."
}4. 流水线示例
以下是完整的 ImageRegionCaptionAPIPipeline 代码实现。
import os
os.environ["DF_API_KEY"] = "sk-xxxx"
from dataflow.operators.core_vision.generate.image_bbox_generator import (
ImageBboxGenerator,
ExistingBBoxDataGenConfig
)
from dataflow.operators.core_vision.generate.prompted_vqa_generator import (
PromptedVQAGenerator
)
from dataflow.utils.storage import FileStorage
from dataflow.serving.api_vlm_serving_openai import APIVLMServing_openai
class ImageRegionCaptionPipeline:
def __init__(
self,
first_entry_file: str = "../example_data/image_region_caption/image_region_caption_demo.jsonl",
cache_path: str = "../cache/image_region_caption",
file_name_prefix: str = "region_caption",
cache_type: str = "jsonl",
input_image_key: str = "image",
input_bbox_key: str = "bbox",
max_boxes: int = 10,
output_image_with_bbox_path: str = "../cache/image_region_caption/image_with_bbox_result.jsonl",
):
self.bbox_storage = FileStorage(
first_entry_file_name=first_entry_file,
cache_path=cache_path,
file_name_prefix=file_name_prefix,
cache_type=cache_type
)
self.cfg = ExistingBBoxDataGenConfig(
max_boxes=max_boxes,
input_jsonl_path=first_entry_file,
output_jsonl_path=output_image_with_bbox_path,
)
self.caption_storage = FileStorage(
first_entry_file_name=output_image_with_bbox_path,
cache_path=cache_path,
file_name_prefix=file_name_prefix,
cache_type=cache_type
)
self.vlm_serving = APIVLMServing_openai(
api_url="https://dashscope.aliyuncs.com/compatible-mode/v1", # Any API platform compatible with OpenAI format
model_name="gpt-4o-mini",
image_io=None,
send_request_stream=False,
max_workers=10,
timeout=1800
)
self.bbox_generator = ImageBboxGenerator(config=self.cfg)
self.caption_generator = PromptedVQAGenerator(serving=self.vlm_serving,system_prompt="You are a helpful assistant.")
self.input_image_key = input_image_key
self.input_bbox_key = input_bbox_key
self.bbox_record=None
def forward(self):
self.bbox_generator.run(
storage=self.bbox_storage.step(),
input_image_key=self.input_image_key,
input_bbox_key=self.input_bbox_key
)
self.caption_generator.run(
storage=self.caption_storage.step(),
input_image_key='image_with_bbox',
input_prompt_key='prompt'
)
if __name__ == "__main__":
pipe = ImageRegionCaptionPipeline()
pipe.forward()
