General Audio Q&A Operator

About 588 wordsAbout 2 min

2025-10-14

📘 Overview

PromptedAQAGenerator is a general-purpose prompt-based generation operator. It combines a user-provided system prompt with specific input content and invokes an audio large language model (LALM) to produce corresponding textual output. This operator is highly flexible and can be used for a wide range of text generation tasks that require customized prompts.

`init`

def __init__(self, 
            vlm_serving: VLMServingABC, 
            system_prompt: str = "You are a helpful assistant.",
            )

`init` parameters

Parameter	Type	Default	Description
`vlm_serving`	`VLMServingABC`	Required	The audio multimodal large model service instance used to perform generation.
`system_prompt`	`str`	"You are a helpful assistant."	The system prompt that defines the behavior or role of the audio multimodal large model.

Prompt Template Description

This operator does not use a fixed prompt template. Instead, it combines the system_prompt parameter and the content corresponding to the input_key in the run function to form the final prompt.

Note: For the Whisper model, we can configure the corresponding task in the system_prompt parameter. The method is as follows:

from dataflow.prompts.whisper_prompt_generator import WhisperTranscriptionPrompt

system_prompt = WhisperTranscriptionPrompt.generate_prompt(
    language=None,
    task="transcribe",
    with_timestamps=False,
)

def __init__(
    self, 
    vlm_serving: VLMServingABC, 
    system_prompt: str = system_prompt,
)

Parameter	Default	Description
`language`	`None`	The language of the audio file, used to specify the transcription or translation language. If set to None, the language is automatically detected from the audio content.
`task`	`"transcribe"`	The type of task, which can be either `"transcribe"` (for speech transcription) or `"translate"` (for speech translation).
`with_timestamps`	`False`	Whether to include timestamps in the transcription results.

`run`

def run(self, storage: DataFlowStorage, input_audio_key: str = "audio", input_conversation_key: str = "conversation", output_answer_key: str = "answer"):

Executes the main logic of the operator. It reads the input DataFrame from storage, combines the system_prompt with the input content, calls the LALM to generate results, and writes the results back to storage.

Parameters

Parameter Name	Type	Default Value	Description
`storage`	`DataFlowStorage`	Required	The data storage instance for input and output, containing the input DataFrame and output results.
`input_audio_key`	`str`	`audio`	The column name in the input DataFrame that contains the paths to the audio files.
`input_conversation_key`	`str`	`conversation`	The column name in the input DataFrame that contains the conversation.
`output_answer_key`	`str`	`answer`	The column name in the output DataFrame that will store the generated results.

🧠 Example Usage

from dataflow.operators.core_audio import PromptedAQAGenerator
from dataflow.serving import LocalModelVLMServing_vllm
from dataflow.utils.storage import FileStorage

class AQAGenerator():
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="../example_data/audio_vqa/sample_data_local.jsonl",
            cache_path="./cache",
            file_name_prefix="audio_aqa",
            cache_type="jsonl",
        )

        self.vlm_serving = LocalModelVLMServing_vllm(
            hf_model_name_or_path="/path/to/your/Qwen2-Audio-7B-Instruct",
            hf_cache_dir='./dataflow_cache',
            vllm_tensor_parallel_size=8,
            vllm_temperature=0.7,
            vllm_top_p=0.9,
            vllm_gpu_memory_utilization=0.6
        )

        self.prompt_generator = PromptedAQAGenerator(
            vlm_serving = self.vlm_serving,
            system_prompt="You are a helpful agent."
        )

    def forward(self):
        self.prompt_generator.run(
            storage = self.storage.step(),
            input_audio_key="audio",
            input_conversation_key="conversation",
            output_answer_key="answer",
        )

if __name__ == "__main__":
    # This is the entry point for the pipeline
    model = AQAGenerator()
    model.forward()

🧾 Default Output Format

Field	Type	Description
`answer`	`str`	The generated caption or transcription of the audio file.

Example Input:

{"audio": ["../example_data/audio_aqa_pipeline/test_1.wav"], "conversation": [{"from": "human", "value": "Transcribe the audio into Chinese." }]}
{"audio": ["../example_data/audio_aqa_pipeline/test_2.wav"], "conversation": [{"from": "human", "value": "Describe the sound in this audio clip." }]}

Example Output:

{"audio":["..\/example_data\/audio_aqa_pipeline\/test_1.wav"],"conversation":[{"from":"human","value":"Transcribe the audio into Chinese."}],"answer":"The audio states: '二十三家全国品牌企业市场份额已达到百分之二十三点三一'"}
{"audio":["..\/example_data\/audio_aqa_pipeline\/test_2.wav"],"conversation":[{"from":"human","value":"Describe the sound in this audio clip."}],"answer":"The audio contains the sound of a machine turning on and off repeatedly."}

generate

eval

filter

refine

generate

eval

filter

generate

eval

filter

generaterow

refine

General Audio Q&A Operator

📘 Overview

`init`

`init` parameters

Prompt Template Description

`run`

🧠 Example Usage

🧾 Default Output Format

General Audio Q&A Operator

📘 Overview

__init__

init parameters

Prompt Template Description

run

🧠 Example Usage

🧾 Default Output Format

`init`

`init` parameters

`run`