Function Call Data Synthesis Operators

About 837 wordsAbout 3 min

2025-07-20

Overview

Function call data synthesis operators are designed to synthesize structured function call data from dialogues or real-world task descriptions. These operators cover scenario extraction and expansion, task generation and validation, function generation, and multi-agent multi-turn conversation generation.

All related operators are located in dataflow/operators/conversations/func_call_operators.py. The table below summarizes their applicable scenarios:

Name	Type	Description	Repo or Paper
ScenarioExtractor	Scenario Extraction	Extracts scenario descriptions from conversations using LLM.	Data Paper
ScenarioExpander	Scenario Expansion	Generates alternative scenarios based on original ones using LLM.
AtomTaskGenerator	Task Generation	Generates atomic tasks from scenario descriptions using LLM.
SequentialTaskGenerator	Task Generation	Generates subsequent tasks and composes them into sequential tasks.
ParaSeqTaskGenerator	Task Generation	Generates parallel and subsequent tasks and combines them with the original task.
CompositionTaskFilter	Task Filtering	Validates compositional tasks and filters out incomplete ones using LLM.
FunctionGenerator	Function Generation	Generates function definitions for a given task composition and its subtasks.
MultiTurnConversationGenerator	Dialogue Generation	Generates multi-turn conversations with User, Assistant, and Tool agents based on tasks and functions.

Operator Details

1. ScenarioExtractor ✨

Description:
Extracts concise task scenario descriptions from dialogue using an LLM.

Parameters:

__init__()
- llm_serving: LLM interface instance
run()
- storage: data storage interface
- input_chat_key: field name for conversation input
- output_key: output field name (default: "scenario")

Highlights:

Strong contextual understanding
Forms basis for downstream task generation
Supports batch processing

2. ScenarioExpander ✨

Description:
Expands extracted task scenarios to generate varied alternatives via LLM.

Parameters:

__init__()
- llm_serving: LLM interface instance
run()
- storage: data storage interface
- input_scenario_key: field name of original scenario
- output_key: output field name (default: "modified_scenario")

Highlights:

Enhances scenario diversity
Useful for data augmentation

3. AtomTaskGenerator ✨

Description:
Generates fine-grained atomic tasks from a given scenario.

Parameters:

__init__()
- llm_serving: LLM interface instance
run()
- storage: data storage interface
- input_scenario_key: field name for scenario input
- output_key: output field name (default: "atom_task")

Highlights:

Atomic-level task granularity
Task decomposition from scenario

4. SequentialTaskGenerator ✨

Description:
Creates follow-up tasks and combines them with atomic tasks into a sequential flow.

Parameters:

__init__()
- llm_serving: LLM interface instance
run()
- storage: data storage interface
- input_task_key: field name for atomic task
- output_subsequent_task_key: subsequent task field (default: "subsequent_task")
- output_composition_task_key: composed task field (default: "composition_task")

Highlights:

Supports multi-step task flow generation
Clear structure and traceability

5. ParaSeqTaskGenerator ✨

Description:
Generates parallel and sequential extensions for an atomic task and composes them into a complex task.

Parameters:

__init__()
- llm_serving: LLM interface instance
run()
- storage: data storage interface
- input_task_key: atomic task field
- output_parallel_task_key: parallel task field (default: "parallel_task")
- output_subsequent_task_key: subsequent task field (default: "subsequent_task")
- output_composition_task_key: composed task field (default: "composition_task")

Highlights:

Multi-dimensional task modeling
Captures concurrency and sequencing

6. CompositionTaskFilter ✨

Description:
Validates if a composed task is logically complete and executable. Filters invalid or incoherent compositions.

Parameters:

__init__()
- llm_serving: LLM interface instance
run()
- storage: data storage interface
- input_composition_task_key: composed task field
- input_sub_tasks_keys: list of subtask field names
- output_key: label field for executability (default: "runable_label")

Highlights:

Logical and semantic validation
Filters for downstream function generation

7. FunctionGenerator ✨

Description:
Generates structured function call specifications (name, parameters, doc) for a composed task and its subtasks.

Parameters:

__init__()
- llm_serving: LLM interface instance
run()
- storage: data storage interface
- input_composition_task_key: composed task field
- input_sub_tasks_keys: subtask field names
- output_key: output field for functions (default: "functions")

Highlights:

LLM-based function synthesis
Designed for tool/agent integration
Structured JSON-like output

8. MultiTurnConversationGenerator ✨🚀

Description:
Simulates multi-turn conversations involving User, Assistant, and Tool agents to complete the composed task via function calls.

Parameters:

__init__()
- llm_serving: LLM interface instance
run()
- storage: data storage interface
- input_task_key: composed task field
- input_sub_tasks_keys: list of subtask fields
- input_functions_key: field name for function list
- output_conversations_key: output field for conversations (default: "conversations")

Highlights:

Multi-agent interactive generation
Supports function call injection
Up to 5 full interaction rounds

For code examples, refer to the Function Call Data Synthesis Pipeline or the GitHub source file.